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Abstract 

It has been argued persuasively that, in order to evaluate climate models, the prob- 
ability distributions of model output need to be compared to the corresponding 
empirical distributions of observed data. Distance measures between probability 
distributions, also called divergence functions, can be used for this purpose. We 
contend that divergence functions ought to be proper, in the sense that acting on 
modelers' true beliefs is an optimal strategy. Score divergences that derive from 
proper scoring rules are proper, with the integrated quadratic distance and the 
Kullback-Leibler divergence being particularly attractive choices. Other commonly 
used divergences fail to be proper. In an illustration, we evaluate and rank sim- 
ulations from fifteen climate models for temperature extremes in a comparison to 
re-analysis data. 



1 Introduction 

Traditionally, climate models have been assessed by comparing summary statistics or 
point estimates that derive from the simulated model output to the corresponding ob- 
served quantities [HI [17] . However, there is a growing consensus that in order to evaluate 
climate models the probability distribution of model output needs to be compared to the 
corresponding distribution of observed data. Guttorp [I0| p. 820] argues powerfully that 

"Climate models are difficult to compare to data. Often climatologists compute some sum- 
mary statistic, such as global annual mean temperature, and compare climate models using 
observed (or rather estimated) forcings to the observed (or rather estimated) temperatures. 
However, it seems more appropriate to compare the distribution (over time and space) of 
climate model output to the corresponding distribution of observed data, as opposed to 
point estimates with or without confidence intervals." 

Palmer [19^ p. 850] joins this argument, by proposing the use of the Hellinger distance 
between simulated and observed probability distributions to assess climate models. A 
pioneering effort in these directions is that of Perkins et al. |2U] who use a transportation 
type metric; however, this metric applies to densities only and thus requires density 
estimates. 
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Taking a broader perspective, predictive models are key tools in many realms of 
science and society, and principled methods for assessing their performance are in strong 
demand. Any such evaluation procedure ought to provide a quantitative assessment of 
the compatibility of the simulation model, and the real world phenomena it is meant to 
represent, in a manner that encourages careful assessment and integrity. In this paper, 
we focus on the setting in which the simulation model supplies a probability distribution, 

F, typically consisting of model output composited over time and/or space, and we 
wish to evaluate it against the empirical measure, Gk, associated with the corresponding 
observations. The evaluation then is usually performed by computing a distance measure 
or divergence, d, between the probability distributions F and Gk, where the divergence 
function d is such that d{F, F) = and d{F, G) > for all probability distributions F and 

G. Competing climate models give rise to distinct simulated probability distributions, 
and those that yield the smallest divergences when compared to the empirical measure 
are considered to perform best. 

Myriads of distance measures for probability distributions have been studied in the 
literature, as reviewed comprehensively by Deza and Deza [6l Chapter 14], and it is far 
from obvious which ought to be used for evaluation purposes. Our paper aims to provide 
guidance in the choice of the divergence function in a general setting, including both 
categorical and continuous, real- or vector-valued observations. The task is not unique 
to climatology, and essentially the same problem arises in the emerging transdisciplinary 
field of uncertainty quantification. In this latter context. Person et al. [7, Section 5.2] list 
desirable properties of a divergence function d in the case of a real-valued observation, in 
thatH 

(a) the divergence d{6x,Sy) between point measures 6x and 6y ought to reduce to the 
absolute error \x — y\; 

(b) the divergence ought to be sensitive to all distributional facets, rather than just 
means and variances; 

(c) the divergence ought to be expressible in physical unit^ 

(d) the divergence function ought to be "mathematically well behaved and well under- 
stood" . 

Our goal here is to formalize property (d), by proposing and studying a propriety condi- 
tion for divergence functions that resembles the classical propriety condition of Winkler 
and Murphy [26] for scoring rules. Specifically, suppose that F is a predictive probability 
distribution for a single future quantity or event, y. In this setting, predictive performance 
is assessed by assigning a numerical score, s{F,y), based on the predictive distribution, 
F, and the verifying value, y. We suppose that scores are negatively oriented, that is, the 
smaller, the better, and we write s{F, G) for the expectation of s{F, y) when y is a. random 
variable with distribution G. The scoring rule s is said to be proper if s{G, G) < s{F, G) 
for all F,G E J^, where J-" is a suitable class of probability measures. In other words, 
a proper scoring rule is such that the expected score or penalty is minimized when the 
predictive distribution, F, agrees with the true distribution, G, of the quantity or event to 

^Person et al. [3 Section 5.2] furthermore request the unboundedness of the divergence function. We 
suppress this request, as the compatibihty property (a) imphes unboundedness. 

^For example, if we are concerned with temperature simulations in degrees Celsius, then the divergence 
should have a value in degrees Celsius. 
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be forecast. As the use of proper scoring rules encourages honest and careful assessments, 
and is deeply entrenched in first decision theoretic principles, it is considered essential in 
scientific and managerial practice |H |9] . 

2 Notions of Propriety for Divergences 

Divergence functions address situations where the prediction takes the form of a probabil- 
ity distribution F within a convex class, J-", of probability measures on a general sample 
space VL. The predictive distribution then needs to be compared against observations 
Ui, . . . ,yk and the corresponding empirical measure, 

1 

Gk = -j^^^y^^ 

1=1 

where 6y. is the point measure in the observation yi & Q. A divergence then is a function 

d : J" X J"^ M 

such that d{F, F) = and d{F, G) > for all probability distributions F,G E J-'. Diver- 
gences are sometimes called divergence functions or discrepancy functions, or validation 
metrics [H], even though they are typically not metrics in the mathematical sense. We 
generally assume that the convex class J-" contains all empirical measures, and we write 

^Gd{F,Gk) 

to denote the expectation of d{F, Gk) when the predictive distribution is F G J-", and the 
observations . . . , G f2 in the empirical measure are independent with distribution 

G G 

Definition 1. A divergence function d : x T ^ is k -proper for a positive integer k 

EGd{G,Gk)<^Gd{F,Gk) (1) 
for all probability distributions F,G E J^. It is asymptotically proper if 

limfc^oo EadiG, Gk) < lim^^oo Ego?(F, Gk), (2) 

for all F,GeT. 

Thus, if the divergence is fc-proper, and we believe that the observations form a 
sample from the probability distribution G, then an optimal strategy is to act on one's 
true beliefs. In this sense, the use of proper divergence functions encourages honest and 
careful assessments. 

Technically, the propriety condition ([T]) relates closely to the corresponding condition 
for scoring rules. Recall that a scoring rule s : J-" x f2 — )• M is proper if 

s{G,G) = EGs{G,y) < EGs{F,y) = s{F,G) 

for all probability distributions F,G E J^. Thus we see that if the divergence d is 1- 
proper, the scoring rule s{F, y) = d{F, 6y) is proper. Conversely, if s is a proper scoring 
and s{G, G) is finite for all G E then 

d{F,G) = s{F,G)-s{G,G) 
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is a divergence function, which we refer to as the score divergence associated with the 
proper scoring rule s. 



Theorem 2. If d : T x T ^ is a score divergence, then d is k -proper for all positive 
integers k. 

Proof. If (i is a score divergence, then there exists a proper scoring rule s such that 

d{F,Gk) = s{F,Gk)-s{Gk,Gk), (3) 

where 

k 

.(F,G,) = ^^s(F,y,), (4) 

i=l 

and s{Gk, Gk) denotes the expected score when the predictive distribution is the empirical 
measure and the observation is drawn from the latter in a bootstrap-like manner. Using 
these conventions, 



E, 



G 



^Gs{Gk.G, 



k 



¥.Gd{G, Gk) = Egs(G, Gk) - Egs(G,, Gk) 

k 

\ll<G.Y,) 

i=l 

= s{G,G)-¥.Gs{Gk^Gk) 
<s{F,G)-¥.Gs{GkMk) 
= EGd{F,Gk), 

which is the desired expectation inequality. □ 

A score divergence in fact satisfies the defining inequality ([T]) whenever the obser- 
vations Hi, . . . ,yn are identically distributed, with or without an assumption of indepen- 
dence. In practice, the score divergence ^ and the raw score Q yield the same rankings, 
though the divergence might be preferred, as it is nonnegative and anchored at an ideal 
value of zero. 

A usefully general construction proceeds as follows. Let h : Q x Q ^ [0, oo) be a 
negative definite kernel, i.e., h{x, y) = h{y, x) for all x,y E Q, and for all positive integers 
n, all ai, . . . , a„ G M that sum to zero, and all Xi, . . . , a;„ G it is true that 

EILi aittjhixi, Xj) <0. 

Subject to natural moment conditions a proper scoring rule can be defined as 

y) = ^ Eph{x, x') - Ephix, y), (5) 

where x and x' are independent random variables with distribution F [9j. In the subse- 
quent sections, we will identify various popular divergence functions as score divergences 
associated with proper scoring rules of this form. 

Clearly, if the divergence d is /c-proper for all positive integers k, then it is asymptot- 
ically proper. However, asymptotic propriety arises under rather weak conditions. 

Theorem 3. If for each G E J-' there exists a constant cg such that d{G,Gk) < cg for 
all positive integers k and d{G, Gk) — > almost surely, then d is asymptotically proper. 
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Proof. By the Lebesgue Dominated Convergence Theorem, 

hmfc^oo Eg d{G, Gk) = Eg hm^^oo d{G, Gk) = 0. 

In view of the divergence d being nonnegative, the defining inequahty ^ holds for all 
F,G eT. □ 

In applications, the number k is often naturally limited by the number of time points 
or spatial locations at a fixed scale within the study domain. In such cases, the asymptotic 
properties of the divergence function will never be attained. Therefore, it is important to 
rank competing models by using a divergence that is fc-proper, rather than asymptotically 
proper. 



3 Continuous Outcomes 

We now discuss the propriety of commonly used divergence functions in the case of 
continuous outcomes or observations. In this setting we assume that Q C M™, and we let 
J-p denote the convex class of the Borel probability measures on fl with finite absolute 
moments of order p > 0. 



3.1 Integrated Quadratic Distance 

We initially consider real-valued outcomes and identify a probability distribution F on 
i7 = M with its cumulative distribution function (CDF). The integrated quadratic distance 
is then defined as the integral 

/oo 
(F(t)-G(t))'dt, (6) 
-oo 

over the squared difference between the corresponding CDFs, when evaluated at all 
thresholds. 

If we restrict attention to the class J^i of the probability measure with finite first 
moment, well known results P,!23] allow us to identify the integrated quadratic distance as 
the score divergence associated with a kernel score of the form ([s]), where h(x,y) = \x — y\, 
namely the continuous ranked probability score [H |16], which is defined by 

/oo 
{F{x) - l{y < x}f dx = - Ef\x - x'\ - Ef\x - y\, (7) 
-oo ^ 

where l{y < x} denotes an indicator function. This proves the fc-propriety of the in- 
tegrated quadratic score for all positive integers k. Indeed, the score satisfies all the 
properties listed in the introduction, with property (a) being immediate from equation 
(|6| and property (c) being implied by the kernel score representation ([T]). 

Salazar et al. [22] and Raisanen and Raty [21] apply the continuous ranked probability 
score to rank climate model predictions based on the corresponding mean score of the 
form (|4]). As noted, this yields the same ranking as applying the integrated quadratic 
distance (|6| to the empirical measure of the observations. 

A possible generalization is to the class of weighted integrated quadratic distances of 
the form 

d^iQ{F,G)= [ {F{t)-G{t)fw{t)dt, 
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where w is a nonnegative weight function on f2 C M™-, and where we again think of F 
and G as CDFs. An interesting question is that for weight functions that yield proper 
divergences in cases other than that of the integrated quadratic distance, in which we 
have m = 1 and w{t) = 1. 

3.2 Mean Value Divergence and the Optimal Fingerprint Method 

In the optimal fingerprint method [16-19] a model generated climate change signal is said 
to be detected if its amplitude in the observations is large compared to the amphtude of 
unforced model output signal or uncontaminated data. The amplitude is measured by the 
square of the Mahalanobis distance [T3] using the covariance matrix of the natural climate 
variability. In our setting, we consider the divergence between probability measures F 
and G on that is defined by the square, 

d{F, G) = (hf - yUc)' - hg), 

of the Mahalanobis distance, where the positive definite matrix S G M'*"^™ may depend 
on G but not on F. This distance depends on the predictive distribution F only through 
the mean vector fip & I^™, and it is readily seen that 

EcdiF, Gk) - EcdiG, Gk) = (fiF - I^g)' T.~\^if - f^c) > 0, 

whence d is A;-proper for all positive integers k. In particular, if we take S to be the 
identity matrix, we obtain the mean value divergence, 

dMv{F,G) = {hf - IJ^cYil^F - fJ'c)- (8) 

The variant defined by d{F,G) = {fiF - fic)' ^F^f^F - /^g), where ^f e R"'''"' is the 
covariance matrix of F, fails to be proper, as we can diminish the expected divergence 
by artificially inflating E^?. 

3.3 Dawid-Sebastiani Divergence 

Dawid and Sebastiani |5^J study score divergences between probability measures F and G 
on that depend on the mean vector, fiF £ R"^, and the covariance matrix, S^? G M"^^"* 
of the forecast distribution only. A prominent example is the Dawid-Sebastiani divergence, 

dBsiF, G) = tr(S^iSG) - logdet(S^iSG) + (/iF - fiGY^F^f^F - /ic) - m, (9) 

which arises as the score divergence associated with the proper scoring rule O |9] 

siF,y) = -logdetSiT - {fiF - y)'T.~p{^iF - y)- 

The expression on the right-hand side equals the non-normalized log-likelihood function 
for a multivariate normal density, whence dx^siF, G) is equivalent to the Kullback-Leibler 
divergence between multivariate normal distributions with mean vectors and covariance 
matrices equal to those of F and G. 
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3.4 Area Validation Metric and Wasserstein Distance 



We now return to real- valued outcomes and identify a probability distribution F onil 
with its CDF. In this setting, Person et al. [7j propose the area validation metric, 



/oo 
|F(t)-G(t)|dt, 
-oo 



as a divergence function. This resembles the integrated quadratic distance (|6| and satisfies 
the desirable properties (a), (b) and (c) stated in the introduction. 

Furthermore, by Theorem 3 and standard results in the theory of empirical processes 
[23 P- 66], the area validation metric is asymptotically proper as a divergence measure for 
probability distributions with finite first moment. Nevertheless, property (d) is violated, 
as dAv generally fails to be /c-proper. For example, let G be the uniform distribution on 
[0, 1], and let Fk be discrete with probability mass 1/k in x = i/ {k + 1) for i = 1, . . . , k. 
Then 



I = EcdMFi, Gi) < EcdAviG, G^) = l 

and E,GdAY{Fk,Gk) < E,GdAY{G,Gk) for A; < 25 in Monte Carlo experiments, thereby 
suggesting that the dAv encourages underdispersed model simulations. 

The area validation metric can be identified with a special case of the transportation 
distance or Wasserstein distance [21 E] of order p > 1, namely 

i/p 

dp{F,G)= ( / \F{t)-G{t)\Pdt' 



\ J — oo 

(^^\f-^{u) - G-^{u)\^ dn^ ' 
[mi Ef,g\x - y 



IP 



where F^^{u) = mi{t : F(t) > u} and the infimum is taken over all jointly distributed 
random variables x and y with marginal distributions F and G, respectively. While the 
Wasserstein distance dp is asymptotically proper as a divergence measure within the class 
J-'p, it generally fails to be fc-proper for finite k. 



3.5 Kolmogorov-Smirnov Distance 

Considering again real-valued outcomes, the Kolmogorov-Smirnov distance, 

dKs{F,G) = snp,^^\Fit)-Git)\, 

where F and G are interpreted as CDFs, is commonly used as a divergence function. By 
the Glivenko-Cantelli Theorem and our Theorem 3, the Kolmogorov-Smirnov distance 
is asymptotically proper. However, it is generally not A;-proper. For example, if G is 
uniform on [0, 1] and F^ is discrete with mass 1/(A; + 1) in i/k for i = 0, ... ,k, then 

I = EcdKsiFi, Gi) < EcdKsiG, Gi) = |, 

and EcdKsiFk, Gk) < EcdxsiG, Gk) for A; < 5 in Monte Carlo experiments. We thus have 
reservations about the use of the Kolmogorov-Smirnov distance for ranking competing 
models; however, we hasten to add that any concerns about the use of improper divergence 
functions are task dependent and may not apply to testing problems. 
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4 Categorical Outcomes 



We now consider categorical outcomes or observations. Without loss of generality, we may 
assume that the sample space is the finite set f2 = {l,...,c}. A probability distribution F 
on can then be identified with a probability vector (/i, . . . , /c). Similarly, the empirical 
measure Gk can be identified with a vector of the form (^^i, . . . , ^kc)' ■, where 'gkj = H^{yi = 
i -.1 = l,...,k}/k for j = l,...,c. 

4.1 Kullback-Leibler Divergence 

The Kullback-Leibler divergence is a commonly used asymmetric measure of the difference 
between two probability vectors, namely 

dKL(F,G) = V/-,log^. (10) 

The Kullback-Leibler divergence is the score divergence associated with the proper loga- 
rithmic score [U |9] and so it is fc-proper for all positive integers k by Theorem 2. 

4.2 Brier Divergence 

The quadratic or Brier divergence, 

c 

d^{F,G) = Y,{h-9i)' (11) 

i=l 

is the score divergence associated with the Brier score O [H] , and thus it is /c-proper for 
all positive integers k. 

4.3 Hellinger Distance 

Palmer [19] suggests the use of the Hellinger distance, 




to compare probability distributions derived from climate model output to the corre- 
sponding data distributions. While the Hellinger distance is asymptotically proper by 
Theorem 3, it generally fails to be /c-proper. When c = 2 and F and G are represented 
by the probability vectors (/i, 1 — /i) and {gi, 1 — gi), respectively, we find that 

EGrfH(F,Gi) = ^71^/1- V7^+(l-^7i)v/l- V^T^, 

which generally does not attain a minimum when the forecast probability /i equals gi. 
For instance, if /i = 0.10 and gi = 0.25, Monte Carlo experiments show that 

EGdn{F,Gk)<EGdniG,Gk) 

for A; = 1,2,5,6 and 10. 
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(a) IQ distance from ERA-40 



(b) IQ distance from NCEP-1 





(c) MV divergence from ERA-40 



(d) MV divergence from NCEP-1 



Figure 1: Grid box specific divergences of yearly maximum temperature simulations 
for the period 1961-1990 by tlie ECHAM5/MPI-0M model in terms of the integrated 
quadratic (IQ) distance, in the unit of degrees Celsius, and the mean value (MV) diver- 
gence from the re-analysis data ERA-40 and NCEP-1. 



5 Case study: Temperature Extremes Over Europe 

Anthropogenic climate change has awaken strong scientific, societal and policy interest in 
predicting future climate evolution. The simulation of climate elements is usually carried 
out by coupled atmosphere-ocean circulation models. The output from the models is 
then used to provide insight into future climate states. In this illustration, we focus on 
annual maxima of 2m (surface) temperature. We evaluate 15 different climate models by 
comparing simulations of past climate to corresponding re-analysis data, where historical 
weather observation data are used to reconstruct realized states of the atmosphere on a 
global grid, thereby facilitating direct comparison to climate model output. 

Our study region consists of large parts of Europe (8W-40E, 32N-74N) as illustrated 
in Figure [T} The climate model projections are obtained from the Coupled Model In- 
tercomparison Project phase 3 (CMIP3) multi-model dataset [18]. The various models 
operate on distinct global grids. To allow for a grid based comparison, we transform to 
a common grid of size 144 x 73, as for the re-analysis data. At this resolution, our study 
region contains a total of 154 grid boxes. As observational data, we work with the ERA- 
40 re-analysis [25] by the European Centre for Medium-Range Weather Forecasts and the 
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Figure 2: Evaluation of yearly maximum temperature simulations for the period 1961- 
1990 by the 15 CIMP3 climate models in terms of the integrated quadratic (IQ) distance, 
in the unit of degrees Celsius, and the mean value (MV) divergence from the re-analysis 
data ERA40 (blue bars) and NCEP (red bars). The bars show the average divergence 
over the 154 grid cells in Figure [1} 



NCEP-1 re-analysis [I3] by the U.S. National Centers for Environmental Prediction. In 
our comparison, we only use a single simulation from each of the 15 climate models; if 
multiple simulations are available, one instance is selected at random. Both the climate 
models and the re-analyses provide daily outputs for 2m maximum temperature. At each 
grid box, the climate model simulation then is represented by the location-specific em- 
pirical distribution function of the resulting annual maxima over the study period from 
1961 to 1990. 

Figure |2] shows the integrated quadratic distance ^ in the unit of degrees Celsius and 
the mean value divergence ^ of the simulations from the two sets of re-analysis data, 
with each value being an average over the 154 grid boxes. The divergences result in similar 
rankings relative to a fixed re-analysis data set, while we see larger differences between 
the re-analyses. For three models, the divergences between the model and the re-analysis 
data are at the level of the internal variability between the ERA-40 and the NCEP-1 
re-analyses, which is the best performance one might reasonably hope for. These are the 
ECHAM5/MPI-0M model from the Max Planck Institute for Meteorology in Germany, 
the GFDL-CM2.1 model developed by the Geophysical Fluid Dynamics Laboratory in 
the United States and the MRI-CGCM2.3.2 model from the Meteorological Research 
Institute of Japan. The first two models have a slightly higher spatial resolution than 
the re-analyses while the MRI-CGCM2.3.2 model operates on a grid of size 128 x 64. 

In Figure [l| we can identify the location-specific divergences for the ECHAM5 /MPI- 
OM model under both sets of re-analysis data. Any differences between the two re- 
analyses seem to be more pronounced in coastal regions, in particular around the Mediter- 
ranean Sea and on the Eastern shore of the Baltic Sea. As the two types of divergences 
operate on different scales, the results for the integrated quadratic distance cannot be 
compared directly to the mean value divergence. However, the results seem to indicate 
that there is more local variability under the integrated quadratic distance, as is to be 
expected, given that it addresses all distributional features, as opposed to the mean value 
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divergence. 



6 Discussion 

We have introduced the notion of propriety for divergences, i.e., distance measures be- 
tween probabihty distributions. In a nutshell, if the divergence is proper, and we believe 
that the observations form a sample from the probability distribution G, then an optimal, 
expectation minimizing predictive distribution is G. This property is important in set- 
tings such as that of Figure 2, where we compared climate models based on the average 
divergence between the probability distributions of model output and the correspond- 
ing empirical distributions of temperature extremes. For a proper divergence function, 
the average divergence criterion favors skillful climate models in decision theoretically 
coherent ways. 

In particular, score divergences that derive from proper scoring rules are fc-proper for 
all positive integers fc, with the integrated quadratic distance (|6|, the Dawid-Sebastiani 
divergence ([9]) and the Kullback-Leibler divergence (10) being particularly attractive 



choices, while other commonly used distances fail to be proper. In the case of real-valued 
variables, the integrated quadratic distance is the only divergence available that satisfies 
the desirable properties listed in the introduction, and thus we endorse and encourage its 
use. 

The evaluation of model simulations continues to provide challenges. An important 
caveat is that if we evaluate model simulations for a univariate quantity, such as an an- 
nual temperature maximum, by using a divergence function, the assessment is restricted 
to univariate aspects. If multivariate aspects, such as the behavior of temperature ex- 
tremes at several sites simultaneously, are of interest, divergence functions for probability 
distributions in higher-dimensional spaces are to be considered. 
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